Return Home

Sales prediction model for State of Connecticut Real Estate Sales 2001-2022 GL¶

The Office of Policy and Management maintains a listing of all real estate sales with a sales price of $2,000 or greater that occur between October 1 and September 30 of each year. For each sale record, the file includes: town, property address, date of sale, property type (residential, apartment, commercial, industrial or vacant land), sales price, and property assessment.

Source Data : https://catalog.data.gov/dataset/real-estate-sales-2001-2018

Return Home : https://johnkimaiyo.vercel.app/

Creating a prediction model using Python and Pandas involves several steps, including data preprocessing, exploratory data analysis, feature engineering, model selection, training, and evaluation.

Step 1: Import Necessary Libraries¶

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import joblib

Step 2: Load the Dataset¶

In [2]:
Real_Estate_Sales_df = pd.read_csv(r"C:\Users\jki\Desktop\Data Scence Projects\Sate of Connecticut Real Etstae Sales\Machine Learning\Source Data\Real_Estate_Sales_2001-2022_GL.csv")

# Display the first few rows of the dataset
print(Real_Estate_Sales_df.head())
C:\Users\jki\AppData\Local\Temp\ipykernel_20024\2726286131.py:1: DtypeWarning: Columns (8,9,10,11,12) have mixed types. Specify dtype option on import or set low_memory=False.
  Real_Estate_Sales_df = pd.read_csv(r"C:\Users\jki\Desktop\Data Scence Projects\Sate of Connecticut Real Etstae Sales\Machine Learning\Source Data\Real_Estate_Sales_2001-2022_GL.csv")
   Serial Number  List Year Date Recorded     Town           Address  \
0         220008       2022    01/30/2023  Andover       618 ROUTE 6   
1        2020348       2020    09/13/2021  Ansonia   230 WAKELEE AVE   
2          20002       2020    10/02/2020  Ashford   390 TURNPIKE RD   
3         210317       2021    07/05/2022     Avon   53 COTSWOLD WAY   
4         200212       2020    03/09/2021     Avon  5 CHESTNUT DRIVE   

   Assessed Value  Sale Amount  Sales Ratio Property Type Residential Type  \
0        139020.0     232000.0       0.5992   Residential    Single Family   
1        150500.0     325000.0       0.4630    Commercial              NaN   
2        253000.0     430000.0       0.5883   Residential    Single Family   
3        329730.0     805000.0       0.4096   Residential    Single Family   
4        130400.0     179900.0       0.7248   Residential            Condo   

  Non Use Code Assessor Remarks OPM remarks  \
0          NaN              NaN         NaN   
1          NaN              NaN         NaN   
2          NaN              NaN         NaN   
3          NaN              NaN         NaN   
4          NaN              NaN         NaN   

                             Location  
0  POINT (-72.343628962 41.728431984)  
1                                 NaN  
2                                 NaN  
3  POINT (-72.846365959 41.781677018)  
4                                 NaN  

Step 3: Data Preprocessing¶

Before building the model, you need to preprocess the data. This includes handling missing values, converting data types, and encoding categorical variables if necessary.

In [4]:
# lets check for  missing values
missing_values =  Real_Estate_Sales_df.isna().sum()
print(missing_values)
Serial Number             0
List Year                 0
Date Recorded             2
Town                      0
Address                  51
Assessed Value            0
Sale Amount               0
Sales Ratio               0
Property Type        382446
Residential Type     398389
Non Use Code         784178
Assessor Remarks     926401
OPM remarks         1084598
Location             799518
dtype: int64
In [7]:
# lets remove missing values
Real_Estate_Sales_df.dropna(subset=['Date Recorded','Address','Property Type','Residential Type','Non Use Code','Assessor Remarks','OPM remarks','Location'],inplace =True)
# lets confirm existance of missing values
missing_values = Real_Estate_Sales_df.isna().sum()
print(missing_values)
Serial Number       0
List Year           0
Date Recorded       0
Town                0
Address             0
Assessed Value      0
Sale Amount         0
Sales Ratio         0
Property Type       0
Residential Type    0
Non Use Code        0
Assessor Remarks    0
OPM remarks         0
Location            0
dtype: int64
In [5]:
# let check if we have unwanted negative values
Real_Estate_Sales_df.describe()
Out[5]:
Serial Number List Year Assessed Value Sale Amount Sales Ratio
count 1.097629e+06 1.097629e+06 1.097629e+06 1.097629e+06 1.097629e+06
mean 5.370357e+05 2.011218e+03 2.818016e+05 4.053146e+05 9.603926e+00
std 7.526074e+06 6.773485e+00 1.657890e+06 5.143492e+06 1.801664e+03
min 0.000000e+00 2.001000e+03 0.000000e+00 0.000000e+00 0.000000e+00
25% 3.071300e+04 2.005000e+03 8.909000e+04 1.450000e+05 4.778667e-01
50% 8.070600e+04 2.011000e+03 1.405800e+05 2.330000e+05 6.105663e-01
75% 1.703410e+05 2.018000e+03 2.282700e+05 3.750000e+05 7.707200e-01
max 2.000500e+09 2.022000e+03 8.815100e+08 5.000000e+09 1.226420e+06
In [12]:
# lets check for data types to perform numerical calculations
Real_Estate_Sales_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 560 entries, 527 to 1093559
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Serial Number   560 non-null    int64  
 1   List Year       560 non-null    int64  
 2   Town            560 non-null    object 
 3   Address         560 non-null    object 
 4   Assessed Value  560 non-null    float64
 5   Sale Amount     560 non-null    float64
 6   Sales Ratio     560 non-null    float64
 7   Year            560 non-null    int32  
 8   Month           560 non-null    int32  
 9   Day             560 non-null    int32  
dtypes: float64(3), int32(3), int64(2), object(2)
memory usage: 41.6+ KB
In [15]:
print(Real_Estate_Sales_df.columns)
Index(['Serial Number', 'List Year', 'Town', 'Address', 'Assessed Value',
       'Sale Amount', 'Sales Ratio', 'Year', 'Month', 'Day'],
      dtype='object')
In [32]:
# lets drop unwanted columns 

# Columns to drop
columns_to_drop = ['Property Type', 'Residential Type','Non Use Code','Assessor Remarks','OPM remarks','Location','Address','Town','Serial Number','List Year']
Real_Estate_Sales_df.drop(columns=columns_to_drop, inplace=True,errors='ignore')
Real_Estate_Sales_df.head(5)
Out[32]:
Assessed Value Sale Amount Sales Ratio Year Month Day
527 207700.0 400000.0 0.5193 2023 9 20
922 49700.0 29900.0 1.6622 2023 2 27
940 208230.0 423261.0 0.4920 2022 10 31
1345 169050.0 351000.0 0.4816 2023 3 30
1632 769900.0 3500000.0 0.2200 2023 4 26

Step 4: Exploratory Data Analysis (EDA)¶

In [33]:
# Summary statistics
print(Real_Estate_Sales_df.describe())

# Correlation matrix
print(Real_Estate_Sales_df.corr())

# Plotting the correlation matrix
import seaborn as sns
sns.heatmap(Real_Estate_Sales_df.corr(), annot=True, cmap='coolwarm')
plt.show()
       Assessed Value   Sale Amount  Sales Ratio         Year       Month  \
count    5.600000e+02  5.600000e+02   560.000000   560.000000  560.000000   
mean     1.701261e+05  9.091623e+05     0.838064  2020.926786    6.951786   
std      1.669984e+05  1.346897e+07     1.091531     1.850397    3.215486   
min      0.000000e+00  4.500000e+03     0.000000  2017.000000    1.000000   
25%      9.025750e+04  1.100000e+05     0.413374  2020.000000    4.000000   
50%      1.416000e+05  2.140000e+05     0.625957  2021.000000    7.000000   
75%      1.992075e+05  3.752500e+05     1.020047  2022.000000   10.000000   
max      2.083410e+06  3.187900e+08    20.020000  2023.000000   12.000000   

              Day  
count  560.000000  
mean    15.817857  
std      9.149467  
min      1.000000  
25%      8.000000  
50%     17.000000  
75%     23.000000  
max     31.000000  
                Assessed Value  Sale Amount  Sales Ratio      Year     Month  \
Assessed Value        1.000000     0.048048    -0.046309  0.017759 -0.014153   
Sale Amount           0.048048     1.000000    -0.040326  0.004436 -0.000965   
Sales Ratio          -0.046309    -0.040326     1.000000 -0.008175  0.013997   
Year                  0.017759     0.004436    -0.008175  1.000000 -0.140702   
Month                -0.014153    -0.000965     0.013997 -0.140702  1.000000   
Day                  -0.026584    -0.045498     0.056925  0.012419  0.000674   

                     Day  
Assessed Value -0.026584  
Sale Amount    -0.045498  
Sales Ratio     0.056925  
Year            0.012419  
Month           0.000674  
Day             1.000000  
No description has been provided for this image

Step 5: Feature Engineering¶

Feature engineering involves creating new features or transforming existing ones to improve the model's performance.

In [34]:
# Display the first few rows after feature engineering
print(Real_Estate_Sales_df.head())
      Assessed Value  Sale Amount  Sales Ratio  Year  Month  Day
527         207700.0     400000.0       0.5193  2023      9   20
922          49700.0      29900.0       1.6622  2023      2   27
940         208230.0     423261.0       0.4920  2022     10   31
1345        169050.0     351000.0       0.4816  2023      3   30
1632        769900.0    3500000.0       0.2200  2023      4   26

Step 6: Splitting the Data¶

Split the data into training and testing sets

In [35]:
# Define features (X) and target (y)
X = Real_Estate_Sales_df.drop(['Sale Amount'], axis=1)
y = Real_Estate_Sales_df['Sale Amount']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape, X_test.shape)
(448, 5) (112, 5)

Step 7: Model Selection and Training¶

In [36]:
# Initialize the model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)
Out[36]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()

Step 8: Model Evaluation¶

Evaluate the model's performance on the test data.

In [37]:
# Make predictions
y_pred = model.predict(X_test)

# Calculate the Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

# Plot the actual vs predicted values
plt.scatter(y_test, y_pred)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs Predicted')
plt.show()
Mean Squared Error: 1358146224171.0066
No description has been provided for this image

Step 9: Making Predictions¶

You can now use the trained model to make predictions on new data.

In [41]:
# Example: Predict on new data
new_data = pd.DataFrame({
   
    'Assessed Value': [20770],

    'Sales Ratio': [0.5],
    'Year': [2026],
    'Month': [5],
    'Day': [15],
    
})

# Save the model to a file
joblib.dump(model, 'real_estate_sales_model.pkl')

predicted_sales = model.predict(new_data)
print(f'Predicted Total Sales: {predicted_sales[0]}')
Predicted Total Sales: 872499.7062873021
In [ ]: